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Summary 

The  IPFP  can  be  viewed  as  a  method  for  maximizing  the  likelihood  for 
certain  loglinear  models  or  equivalently  for  minimizing  the  Kullback-Leibler 
Information  between  two  probability  densities.  Both  of  these  viewpoints 
lead  to  natural  generalizations  of  the  classical  IPFP.  We  examine  the 
generalizations  suggested  by  the  work  of  Csiszar  (1975),  Darroch  and 
Ratcliff  (1972),  and  Haberman  (1974)  and,  with  the  aid  of  the  theory, 
explore  a  practical  example  of  expanding  a  contingency  table. 

Key  words  and  phrases:  Generalized  Iterative  Scaling;  I-divergence; 
Kullback-Leibler  information  number;  Contingency  tables. 


Introduction 

There  are  many  ways  of  calculating  maximum  likelihood  estimates  of  mean 
values  for  logl inear  models.  The  two  most  popular  methods  are  the  Xterative 
Proportional  Fitting  Procedure  (IPFP)  and  variants  upon  Newton's  method. 

Newton's  method  has  many  desirable  properties,  including  its  quadratic  con¬ 
vergence  rate  near  the  maximum  and  the  ability  to  calculate  estimates  of 
asymptotic  covariance  matrices  as  a  by-product  of  the  computations.  Its 
principle  disadvantage  is  a  computational  one  in  that  the  method  requires 
a  considerable  amount  of  storage  and  is  thus  limited  by  the  size  of  the  design 
manifold  being  fitted.  In  many  situations  it  thus  becomes  necessary  to 
consider  alternatives  to  Newton's  method. 

The  Iterative  Proportional  Fitting  Procedure  is  an  alternative  method 
for  fitting  many  classes  of  loglinear  models.  Although  the  method  is  often 
slow  to  converge  it  requires  little  storage.  In  our  experience  it  is  often 
the  storage  requirements  of  an  algorithm,  as  opposed  to  the  computational 
time  required,  that  limit  the  algorithm's  usefulness.  The  classical  IPFP 
(see,  e.g..  Bishop,  Fienberg,  and  Holland  (1975))  is  limited  in  the  type  of 
models  which  may  be  fitted.  As  many  applications  of  the  loglinear  model 
methodology  now  use  models  other  than  simple  factorial  situations,  we  seek 
generalizations  of  the  IPFP  which  extend  its  capabilities  to  any  loglinear 
model  while  preserving  the  desirable  properties  of  the  classical  IPFP. 

There  are  at  least  three  generalizations  of  the  IPFP.  Haberman  (1974) 
shows  that  the  IPFP  is  really  just  a  special  case  of  the  method  of  cyclic 
ascent  for  functional  maximization.  This  observation  immediately  leads  to 
an  algorithm  defined  for  any  loglinear  model.  Csiszar  (1975)  considers  the 
IPFP  as  a  method  for  minimizing  the  Kul Iback-Leibler  information  (or  I-di vergence) 
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between  two  probability  densities.  When  specialized  to  distributions  on 
finite  sets,  Csiszar's  methods  yield  another  type  of  IPFP.  In  Section  2, 
we  show  that  these  methods  and  those  of  Haberman  are  closely  related  and 
yield  equivalent  procedures  in  some  situations.  A  third  generalization  of 
the  classical  IPFP,  discussed  in  Section  3,  is  the  Generalized  Iterative 
Scaling  (GIS)  method  of  Darroch  and  Ratcliff  (1972).  This  generalization 
is  also  developed  in  the  setting  of  minimizing  I-divergence  but  does  not 
appear  to  be  related  to  the  other  methods. 

The  impetus  for  this  report  came  from  an  example  of  expanding  a  contin¬ 
gency  table  into  a  more  manageable  structure  which  appeared  in  Fienberg  and 
Wasserman  (1980).  Section  4  is  devoted  to  a  discussion  of  this  and  similar 
examples. 
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2.  The  Results  of  Haberman  and  Csiszar  on  the  IPFP 

The  Iterative  Proportional  Fitting  Procedure  is  generally  considered  as 
a  method  for  obtaining  the  maximum  likelihood  estimates  for  the  mean  value 
parameter  of  a  loglinear  model  for  a  contingency  table.  The  formulation  of 
the  IPFP  considered  by  Csiszar  is  presented  as  a  problem  of  minimizing  the 
Kullback-Leibler  information  number  between  two  Probability  Distributions 
(P.D.'s).  Although  we  shall  only  use  P.D.'s  defined  on  a  finite  set,  it  is 
instructive  to  outline  Csiszar' s  very  general  formulation  and  specialize  the 
results  as  the  need  arises. 

Haberman  (1974,  pp.  64-73)  noted  that  the  classical  IPFP  is  a  version 
of  the  cyclic  ascent  method  of  functional  maximization  and  suggested  the 
extension  to  arbitrary  loglinear  models.  The  methods  of  Haberman  and  Csiszar 
can  be  viewed  as  dual  methods  although  strictly  speaking  the  algorithms  dual 
to  Csiszar's  encompass  a  much  wider  class  of  maximization  techniques.  We 
shell  concentrate  on  stating  results,  illustrating  the  ideas  with  examples. 
However  we  should  note  that  Csiszar  presents  very  elegant  proofs  by  developing 
a  "geometry"  for  the  information  measure.  The  geometric  ideas  have  (and 
were  perhaps  developed  from)  a  strong  analogy  with  results  in  finite  dimen¬ 
sional  Hilbert  spaces.  We  now  turn  to  a  detailed  discussion  of  the  techni¬ 
ques. 

Let  N,  P,  Q,  R,  S,  T  denote  Probability  Distributions  on  a  measure 
space  (X,lt).  In  our  applications  X  will  be  a  finite  set  and  *■  the  power 
set  of  X.  If  P  is  absolutely  continuous  with  respect  to  Q  (written  as  P 
a.c.  Q)  we  will  denote  the  corresponding  density  by  p^.  The  Kullback-Leibler 
information  number  (or  I-divergence,  or  information  in  P  about  Q),  I(P||Q), 
is  defined  to  be 


When  P  and  Q  are  both  a.c.  N  then  (2.1)  may  be  written  as 

(2.2)  I(P|  |Q)  */pN  In  (pN/qN)dN. 

In  the  above  formulae  we  use  the  conventions  that.  In  (0)  »  0  •  (♦•»)*  0, 

and  In  (r/0)  =  +  «  when  r  e  (0,»). 

In  the  special  case  where  N  is  the  P.D.  which  assigns  equal  weight  to 
each  point  of  a  finite  set  X,  then  all  P.D.'s  on  X  are  absolutely  continuous 
with  respect  to  N.  Unless  otherwise  indicated  (e.g.,  by  the  use  of  some 
subscript)  all  densities  on  finite  sets  will  be  with  respect  to  this  uniform 
N  and  the  probability  function  pN(x)  will  be  written  as  p(x).  In  this 
situation  equation  (2.2)  becomes: 

(2.3)  I (P |  IQ)  = -|yy  •  r  p(x)  In  (p(x)/q(x) ) . 

xf  X 

We  next  consider  the  I-sphere,  ^  ,  with  center  R  and  radius  o,  defined 
by: 

(2.4)  i(R,p)  =  {P  :  I ( P |  |R)  <p),  pc  (0,-]. 

The  I-sphere, ,  contains  P.D.'s  which  are  close,  in  the  information  sense, 
to  a  given  P.D.  If  S  is  a  convex  set  of  P.D.'s  such  that  £  n  (R,«)  f  0,  then 
a  P.D. ,  Q  t  £,  satisfying 

(2.5)  I(0| |R)  *  min  I ( P | |R) , 

Pe£ 

is  called  the  I-projection  of  R  on  £  and  will  be  denoted  by  Q  =  Pg(R). 

The  convex  set  £  of  P.D.'s  is  called  a  linear  set  if  when  P  and  Q  are  in  £ 
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and  T  =  aP  +  (l-a)Q,  («  <  P)  is  a  P.D.  then  T  is  also  in  S.  Csiszar  gives 
conditions  under  which  P^(R)  exists  (it  is  always  unique)  and  develops  a 
geometry  of  I-divergence  by  using  an  analogue  of  Pythagoras'  Theorem. 

As  our  goal  is  to  study  maximum  likelihood  estimation  in  contingency 
tables,  we  turn  briefly  to  the  problem  of  estimating  a  multinomial  probability 
function  with  an  underlying  logl inear  model  for  the  probabilities.  Consider 
a  multinomial  random  vector,  z(x),  of  Z  counts  on  the  set  X,  with  mean 
(m(x)  :  x  e  X)  where  m(x)  =  Z*p(x)  and  p(x)  is  the  probability  an  observa¬ 
tion  falls  in  cell  x  «  X.  The  vector  s(x)  =  z(x)/Z  is  an  observed  probability 
function  on  X.  The  log-likelihood  of  the  data,  z,  given  the  assumed  mean, 
m,  will  be  denoted  by  £{m;z).  A  loglinear  model  for  the  probability  function 
(equivalently  for  the  mean  vector)  asserts  that  the  logarithm  of  the  underlying 
probability  function  is  an  element  of  some  linear  manifold, 7TI ,  i.e.. 

In  (p(x))  e'Yn  . 

It  is  well  known  (see  e.g.,  Haberman  (1974))  that  the  maximum  likelihood 
estimator,  p,  of  p  based  on  the  observed  probability  function,  s,  satisfies 

(i)  In  (p)  < TH.  , 

(2.6) 

(ii)  p  -  s  e  rm.x , 

and  minus  the  log- likelihood  ratio  of  p  compared  with  s  is  proportional  to 

(2.7)  E  s(x)  In  (s(x)/p(x) )  *  I(S||P). 
x«  X 

The  problem  of  minimizing  I-divergence  (i.e.,  equation  (2.5))and 
maximizing  the  log-likelihood  ratio  (i.e.,  equation  (2.7))appear  similar. 

All  the  same, the  relationship  between  the  two  methods  is  not  clear.  In 
order  to  show  that  the  two  problems  lead  to  identical  estimates  we  need  to 
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envoke  a  result  of  Csiszar  (due  originally  to  Kullback  (1959)),  giving  the 
form  of  the  density  of  the  I-projection.  Csiszar's  Theorem  3.1,  which  we 
state  below,  gives  conditions  under  which  the  probability  function  of  the  I- 
projection  satisfies  a  logl inear  model.  The  examples  at  the  end  of  this 
section  and  the  discussion  after  the  theorem  should  help  to  clarify  the 
notation. 


Theorem  1  (Csiszar  (1975),  p.  152). 

Let  3  *  (f  :  y  «  D  be  a  set  of  real  valued  measurable  functions 
Y 

on  X  and  •  {a^  :  y  e  r}be  real  constants.  Let  £  be  the  (linear)  set  of 
all  probability  distributions,  P,  on  (X,3t)  for  which  the  integrals, 

Jf^dP  exist  and 

/ f  dP  *  a  ;  y  «  r  . 

J  y  y 

Then  if  a  P.D.  R  has  I-projection  Q  on  £,  the  density  of  Q  with  respect  to 
R  is  of  the  form 

(2.8)  qR(x)  =  c.  exp  (g(x) )  x  e  M 
=  0  x  4  M 

where  P(M)  =  0  for  every  P  in  £  n^6(R,«)  and  g  belongs  to  the  closed  sub¬ 
space  of  Lj(Q)  spanned  by  the  ^Y's. 

Conversely  if  a  Q  e  &  has  a  density  with  respect  to  R  of  the  form  (2.8) 


where  g  belongs  to  the  linear  space  spanned  by  the  f^'s 
projection  of  R  on  £.  ■ 


then  Q  is  the  I- 
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In  our  applications  of  this  theorem  X  will  be  a  finite  set.  Let  S 
be  an  observed  P.D.  on  X  and  consider  a  set  of  functions  J  =  {f  :  y  e  r) 
which  span  a  linear  manifold  'flrij  .  The  set  of  constants  *4  =  (a^  :  y  *  r) 
will  be  defined  by 

aY  =/VS  =/fYSNdN-  Y  e  r  * 

That  is  the  a^  are  the  “marginal  sums"  of  the  observed  probability  function. 

We  will  call  the  set  ^.determined  by  an  observed  P.D.  S  and  the  functions 
3  ,  the  1  -margins  of  S.  The  set  £  of  the  theorem  is  the  set  of  all  P.D.'s, 

P,  such  that  Jf  dP  =  a^  ,-Jvs  for  all  y  in  r.  In  other  words  the  set  £ 
consists  of  those  P.D.'s  which  have  the  same  3  -margins  as  the  observed 
P.D.  S.  This  is  in  turn  the  same  as  the  set  of  probability  functions,  p, 
such  that  s-p  is  in  .  The  conclusion  of  the  theorem  (equation  (2.8)) 
says  that  if  Q  »  P&(R)  then  qR,  the  density  with  respect  to  R,  satisfies 

In  (qR(x) )  e  "Wj  . 

If  Q  a.c.  N  then  this  is  the  same  as  saying  that 
(2.9)  In  (qN(x) )  «  +  In  (rN). 

Equation  (2.9)  says  that  the  log  probabilities  of  the  I-projection  lie  in 
an  affine  subspace  of  Note  that  if  In  (rN)  is  in  then  Wj  +  In  (r^) 

=  7Tlj  and  the  log  probabilities  lie  in  a  linear  manifold.  When  R 
has  a  density  rN  w.r.t.  N  and  In  (rN)  is  inlDj,  then  the  I-projection 
is  seen  to  satisfy  part  (i)  of  equation  (2.6)  for  the  manifold 

In  the  above  development  we  have  restricted  our  attention  to  those  P.D.'s 
whfch  had  the  same  3  -margins  as  the  observed  P.D.  S.  In  other  words  condition 
(fi)  of  equation  (2. 6), which  required  that  0-s  be  in'Wj.is  satisfied  for  all  p 
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in  £  and  in  particular  for  the  I-projection  q^.  The  conclusion  is  that  qN 
satisfies  all  the  conditions  required  of  the  M.L.E.  and  is  thus  the  M.L.E. 

An  alternative  demonstration  of  this  fact  comes  from  an  argument  due  to 
Darroch  and  Ratcliff  (1972),  which  shows  directly  that  the  likelihood  is 
maximized  by  the  I-projection. 

A  purely  mathematical  interpretation  of  this  result  is  that  the  problem 
of  minimizing  I-divergence  subject  to  linear  constraints  is  the  convex  (or 
Fenchel )  dual  problem  to  maximizing  the  likelihood  subject  to  logl inear 
constraints.  For  further  references  to  this  topic  see  Rockafeller  (1974) 
or  Luenberger  (1969).  We  will  use  the  theorem  with  many  different  spaces  £ 
to  demonstrate  the  duality  between  the  Iterative  Proportional  Fitting  Pro¬ 
cedures  of  Csiszar  and  Haberman. 

To  illustrate  the  preceeding  ideas  we  present  a  simple  example,  where 
duality  and  results  are  well  known. 

Example  1 . 

Consider  Z  observations  cross  classified  according  to  their  response 
level  on  two  factors,  A  and  B,  each  with  3  levels.  We  assume  that  the  data 
can  be  considered  as  Z  independent  multinomial  trials.  The  observed  table 
of  data  is 


B 

:n 

Z12 

Z1 3 

:21 

z22 

CO 

CVJ 

IM 

31 

z32 

Z33 

[I] 


1 


1 


0 


n 


The  maximum  likelihood  estimator,  p,  of  p  maximizes  fc(p;s)  subject 
to  1  n CP )  being  in'Mj.  This  is  the  same  as  maximizing  the  log-likelihood 
ratio,  i .e. , 


£  «  (P.D.  's  P  s.t.  e  f^i.jlp,.  =  a*}. 

ij  ,J  1 

Csiszar's  Theorem  3.1  tells  us  that  the  M.L.E.,  p,  for  the  loglinear  model 
corresponds  toPg(R),  the  I-projection  of  R  onto  £,  for  any  R  which  has  In  (r) 
in  Iflf. .  A  simple  R  which  satisfies  this  is 
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1/9 

1/9 

1/9 

1/9 

1/9 

1/9 

1/9 

1/9 

1/9 

which  assigns  the  same  probability  to  each  cell  in  the  table.  ■ 

Thus  far  we  have  not  given  any  techniques  for  calculating  I-projections. 
In  the  following  paragraphs  we  hope  to  rectify  this  situation  by  presenting 
some  forms  of  the  Iterative  Proportional  Fitting  Procedure. 

Recall  that  a  set  of  P.O.'s,  £,  is  called  linear  if  when  Pj  and  P^ 
are  in  £  and  T  =  aP^  +  (l-ajPg,  a  «  F  is  a  P.D.,  then  T  is  also  in  £.  We 
note  that  if  £  is  defined  by  a  set  of  constraints,  \  ,  and  corresponding 
constants,  <A  ,  then  £  is  a  linear  set.  In  particular  the  maximum  likelihood 
estimation  problem  for  loglinear  models  can  be  posed  in  terms  of  linear 
sets  of  P.O.'s.  Csiszar  presents  results  which  enable  one  to  build  up  the 
total  I-projection  onto  £  cyclically, by  forming  the  I-projections  onto 
other  (and  hopefully  simpler)  spaces,  £^ .  The  statement  of  Csiszar's 
Theorem  3.2  follows. 


Theorem  2  (Csiszar  (1975),  p.  155). 

Let  £,,...,£.  be  arbitrary  linear  sets  of  P.D.'s  on  a  finite  set  X 
k 

with  £  =  n  £.  f  0,  let  R  be  a  P.0,  for  which  there  exists  a  Pi  £ 
i=l  1 

with  P  a.c.  R  and  define  Qj,  Qg, _  recursively  by  letting  Qn  be  the  I- 

projection  of  Qn_-|  on  £n,  n  =  1,2,3, _ where  Qq  3  R  and  £n  *  if  i  *  n 

mod  k.  Then  Qn  converges  (pointwise)  to  the  I-projection,  Q  *P&(R).  ■ 
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Some  analogies  between  this  theorem  and  similar  theorems  about 
projection  operators  on  Hilbert  spaces  are  presented  in  Appendix  1. 

Before  proceeding  to  the  examples,  we  make  two  simple  obser¬ 
vations.  First,  if  are  sets  of  constraint  vectors  with  corres¬ 

ponding  sets  of  constants  «4p...,W^  which  together  determine  linear  sets 

k  H  k  , 

£,,...,  then  the  sets  u  and  u  *4  .  together  determine  the  linear 

k  i=l  1  i=l  1 

set.  n  £  in  other  words  more  constraints  (the  union  of  the  7  leads 
i=l  1  1 

to  a  more  restricted  or  smaller  linear  set  (the  intersection  of  the 

Our  second  remark  concerns  a  special  case  of  Csiszar's  Theorem 
3.2.  Let  3  ^  and  be  sets  of  functions  on  X  with  corresponding  sets  of 
constants,  *4^  and  We  will  assume  that  the  functions  in  3  1  have 

support  Xj  while  those  in  3  ^  have  support  *  X^Xp  Let  e^  be  the 
indicator  function  of  X^  and  the  indicator  function  of  X^.  We  assume 
that  e..  is  in  span  (  3^).  Consider  to  be  the  linear  space  of  positive 
functions  generated  by  3  i  and«4.,  (i  =  1,2),  and  £  the  space  generated  by 
and  For  any  Q  we  define  to  be  equal  to  Q  on 

X.  and  zero  on  X  S X. . 

Corollary  to  Theorem  2. 

Consider  £p  £j  and  Qp  Qg  as  defined  above,  then: 

PS(Q)  -P^Q,)  ♦P^lOp.  • 

A  rough  interpretation  of  the  corollary  is  that  if  the  constraints 
can  be  separated  into  disjoint  pieces,  then  the  I-projection  can  be  similarly 
separated. 

We  now  turn  to  some  illustrations  of  the  preceeding  theorems. 


iliK 
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Example  1  (continued).  A  3  *  3  table. 

Recall  that  we  have  a  table  of  observed  probabilities,  s,  and  a  design 

1  *1 

manifold  0^  defined  by  a  set  of  functions,  3  -  {f^,...,fj;}  and  constants 

j  i  3 

**  *  (a^,* • • ,a^}.  We  consider  an  arbitrary  starting  P.D.,  R  with 


r11 

r12 

r13 

r21 

r22 

r23 

'3, 

r32 

r33 

Z 

ij 


1. 


The  usual  IPFP  alternately  scales  the  rows  and  columns  of  r  to  have  the 
same  margins  as  the  observed  table.  Specifically,  the  row  adjustment  is; 

rU  '  <rij/|  rij>  ”  *  '•2-3- 

which  is  followed  by  the  column  adjustment. 


This  process  is  repeated  until  the  cell  estimates  converge.  Of  course  when 
In  (r)  e  Wf  ,  the  iterations  converge  after  just  one  row  and  column  adjustment. 
The  fitted  values  then  correspond  to  the  M.L.E.'s  for  the  loglinear  model. 

In  (p)  e  .  When  In  (r)  /Wj  the  iteration  need  not  converge  after  one 
cycle.  In  this  case  the  fitted  values  correspond  to  the  M.L.E.'s  for  the 
log  affine  model.  In  (p)  e  In  (r)  + 

Let  us  now  investigate  how  one  could  use  Csiszar's  theorem  to  calculate 
P&(R).  Consider  two  linear  spaces  of  P.O.'s,  and  &2,  defined  by: 
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-  {P.D. 's  P  s.t.  E  f£(ij)p(i,j)  =  a^,  i  *  1,2,3) 

^  J 

fi2  =  {P.D.'s  P  s.t.  E  ff(lj)pd.j)  -  af,  i  «  1,2,3). 

ij  u  L 

That  is  is  the  set  of  P.D.'s  whose  row  sums  agree  with  the  observed 
table  and  &2  consists  of  those  P.D.'s  whose  column  sums  agree  with  the 
observed  table.  As  £  =  n  the  I-projection,  Q  »  Pfi(R)  is  the  limit 
of 

Ql  -  P&W 

Q2=ps2^i) 

Q4=Pg2(Q3).  etc. 

where  Qn  =  P&  (Qn_-j),  i  =  n  mod  2. 

Thus  we  need  to  be  able  to  calculate  the  I-projections  onto  &j  and  £2. 

From  the  definition  of  I-projections,  satisfies 

KQJIR)  *  min  I(Q||R) 

1 

*  min  E  q..  In  (q../r..). 

Qefi,  ij  1J  1J 

Expanding  this  expression  leads  to  the  following  minimization  problem: 

min  q-|i  In  1 7^*1 1 )  +  q12  ^  ^q12^r12^  +  ^aR”qll~q12^  ^  ^R”^ll”^12^r13^ 

2  2 

+  q2i  In  (^21^21  ^  +  q22  ^q22^r22^ +  ^ ®R”q21”q22^  ^  ^aR""q21”q22^'"23^ 

+  q31  1n  ^q31^r3l ^  + q32  1n  ^q32/r32^  +  ^ aR_q31  “q32 ^  1n  ^aR‘q31'q32/r33^  ’ 
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where  q^,  q12,  q2].  *122’  q31  ’  q32  are  allowed  t0  vary  freely  over  (O.1)- 

If  one  takes  partial  derivatives  of  this  expression  with  respect  to 

qll *q12’* ' ' ,q32  and  equates  each  derivative  to  zero  one  obtains  the  equations: 

(i)  In  (q^/r^)  =  In  (<q12/r12^  =  ln  (aR  "  q11  '  q12/r13* 

(ii)  In  (<q2l/r21  ^  =  ln  *q22/r22*  *  ln  _  ^l  "  q22/r23} 

(iii )  In  ( ^21  ^ 31  ^  =  ^  ^q32^32^  ”  ^  ^aR  ”  q31  ”  q32^33^" 

By  removing  the  logarithms  one  sees  that  these  are  just  linear  equations 
whose  solution  is  found  by  scaling  the  rows  so  that  the  marginal  sums  are 
correct.  Analogous  equations  result  for  the  I-projection  onto  £2*  ThuS  ^or 
this  partition  of  the  space,  £,  the  Csiszar  algorithm  reduces  to  the  usual 
IPFP. 

The  particular  subdivision  of  the  space,  £,  is  not  the  only  one  pos¬ 
sible.  Consider  dividing  the  space  £  into  more  linear  spaces,  namely: 

61  =  (P.O.'s  P  s.t.  ZfJdP  *  ZfJdS  =  aj} 

62  *  (P.O.'s  P  s.t.  /f*dP  =  /fjjdS  *  a^) 

&6  *  (P.O.'s  P  s.t.  /f^dP  =  /fJdS  =  a^}. 

6 

As  £  *  n  a.,  Csiszar's  theorem  tells  us  we  can  find  the  I-projection,  Q  , 
i*l 

onto  £  by  cyclically  projecting  onto  £g.  Again  we  need  to  calculate 

each  of  the  elementary  I-projectlons.  For  example  we  need  Q1  such  that 

I(Q, I | R)  *  min  I(Q|)R).  If  we  write  the  expression  out  more  fully  we  require 
'  Q*  £] 


nrin  Qjj  In  +  ^12^r12^  +  ^aR*^ll"^12^  ^aR~^11~^12^r13^ 

+  q21  In  (q21/r2l)  +  q22  In  (q22/r22)  +  q23  In  (q23/r23)  ♦  q31  In  (q31/r31) 

+  q32  In  (q32/r32)  +  ( [1-ap] -q21-q22-q23-q31-q32) 

In  ( n-apl-q21-q22-q23-q31-q32/r33) 

where  the  minimization  is  over  q^,  q^2,  q^,  q22,  q23,  q3^  and  q32  all  in 
(0,1).  We  can  obtain  the  minimizing  as  we  did  before.  The  procedure  is, 

(i)  scale  the  first  row  of  r  so  that  it  has  the  correct  margin 
(ii)  scale  the  rest  of  the  table  so  that  the  sum  of  all  the  cells  is 
again  one. 

The  full  algorithm  then  cycles  through  rows  and  columns  one  at  a  time. 

This  algorithm  is  not  as  efficient  as  the  earlier  procedure,  as  we  need 
to  adjust  the  entire  table  at  each  iteration  and  only  one  of  the  (row)  margins 
is  necessarily  satisfied  at  the  end  of  each  iteration.  In  this  approach  we 
have  ignored  the  corollary  to  the  theorem  whereas  in  the  earlier  approach 
the  corollary  was  implicitly  invoked.  * 

Example  2.  Goodman's  Association  Models  for  Tables  with  Ordered  Categories. 

A  recent  article  by  Goodman  (1979)  presents  a  class  of  models  for  I  x  J 
contingency  tables  with  ordered  categories.  This  class  of  models  postulates  a 
structure  for  the  odds  ratio  (association)  in  the  2x2  subtables  formed  by  cells  in 
adjacent  rows  and  columns  of  the  table.  (It  is  well  known  that  the  odds  ratio  for 
any  2x2  subtable  can  be  recovered  from  the  odds  ratios  in  the  adjacent  subtables.) 
Goodman  presents  two  classes  of  models  for  the  table,  but  we  will  consider  only 
the  log! inear  version,  which  Goodman  denotes  as  Model  I. 


Consider  the  odds  ratio,  0^,  of  the  2><2  subtable  formed  by  the 

*  v 

intersection  of  rows  i  and  i+1  with  columns  j  and  j+1,  i.e., 

e  -  Pi.J  *  Pi+U+1 
1J‘Pi.j+l  *Pi+l.J  ’ 

The  model  we  wish  to  consider  asserts  that  e..  can  be  written  as  the  product 

*  J 

of  a  row  effect  and  a  column  effect,  i.e.. 


0  .  .  =  9  .  •  0  . 

1J  1.  .J 

or  In  (0..)  =  In  (0.  )  +  In  (0  .).  When  written  as  a  loglinear  model  for 

1  J  1  .  •  J 

the  expected  probabilities  this  becomes: 


In  (Pij)  =  Ui  +  8j  +  *  Uy 

We  now  describe  a  spanning  set  and  some  of  the  calculations  required  to  fit 
this  model  for  the  special  case  of  a  3  »  3  table. 

The  linear  manifold, 171 ,  for  this  model  is  spanned  by  a  set  of  tables, 
fR,  fQR,  fjj  and  f^c;  i,j  =  1,2,3.  The  subscripts  R,  OR,  C  and  OC  indicate 
that  the  vector  corresponds  to  Row,  Ordered  Row,  Column  or  Ordered  Column 
parts  of  the  model, while  the  superscript  indicates  the  row  or  column  number. 
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The  general  structure  is  that  (or  f£)  is  a  table  of  zeros  except  for  the  ith 
row  (jth  column)  which  contains  ones,  i.e.. 


fj(M)  - 


1 

0 


k  =  i 
k  f  \ . 


Similarly,  for  the  ordered  row  and  column  tables,  the  general  form  is 


fQC(k,0 


Jk-' 

1.0 


£  =  j 
£  t  j  • 


We  now  group  the  spanning  tables  into  sets  of  related  constraints.  Let 

Jr*  <4  4  :  1  *  1-2-3> 

and 

Jc  ■  <4  4  :  i  * 

We  also  need  the  corresponding  sets  of  constants,  c^and*^.  Generally 
these  are  determined  from  some  observed  table  of  probabilities.  The 
linear  spaces  of  P.O.'s  corresponding  to  these  constraints  and  constants  are: 


*  (P.O.'s  P  s.t.  JfJdP  s  a^;  A  =  0,0R;  i  =  1,2,3) 
&c  «  (P.O.'s  P  s.t.  JfjdP  *  agj  B  =  C,0C;  j  *  1,2,3). 


In  order  to  find  the  M.L.E.'s  of  cell  probabilities  for  this  model  we 
need  to  be  able  to  compute,  P^R;  for  some  suitable  R  and  £  =  £R  n  S^.  The 
theory  tells  us  that  this  I-projection  can  be  obtained  by  cyclically  pro¬ 
jecting  onto  £p  and  2^.  It  is  these  projections  which  we  now  compute.  The 
first  observation  we  should  make  is  that  each  of  the  spaces  is  generated  by 
three  pairs  of  functions  with  disjoint  support.  For  example,  2,^  is  generated 
by  the  pair  of  functions  (fR,  fJR),  (fR.  fgR)  and  ^fR’  f0R^  and  the  suPP°rt 
for  these  functions  is  respectively  the  first,  second,  and  third  rows  of 
the  space  of  tables.  Thus  we  can  apply  the  corollary  to  this  estimation 
problem. 

Consider  a  starting  table  (which  may  already  be  the  result  of  several 
iterations). 


Then  P„  (R)  is  the  probability  function,  p,  which  minimizes 

K 

(2.10)  £  pu  In  (Pfj/rfj) 

subject  to  p  being  in  2^.  By  applying  the  corollary  we  can  separately 
minimize  (2.10)  for  each  i  (i.e.,  row)  and  combine  the  results.  For  i  *  1 
we  need  to  minimize, 

(2.11)  (aj  -  aJR  +  p13)  In  (aj  -  aJR  +  P13/rn)  +  (aJR  -  2p13)  In  (aJR-2p]3/r 

+  P13  In  (P13/r]3). 
subject  to  p^3  e  (0,1 ). 
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By  taking  the  derivative  of  (2.11)  with  respect  to  and  equating 
it  with  zero  we  obtain  the  equation: 

(2.12)  In  (aR  -  agR  +  P^/r^)  -  2  In  (aJR  -  2p13/r12)  +  In  ^P1 3^rl 3^  “  °» 


which  can  be  written  as 

(aR  ‘  a0R  +  p13^  *  (p13^  *  (r12)2 


(a0R  ‘  p13^  r 
or  equivalently  as. 


*  1. 


11 


13 


_2  „  rn*ri3x  .  _  ,j  j  A  rn‘ri3^ 
p13^  2  *  P1 3taR  '  a0R  2a0R  ~2  * 


12 


12 


rn;r13  {al  \2  . 0 

2  va0Rj  u. 
r12 


This  equation  is  relatively  easy  to  solve  for  p13.  The  estimates  for  p^  and 
P12  are  derived  by  solving  the  constraint  equations.  The  equations  from  the 
second  and  third  rows  and  the  columns  are  analogous. 

If  we  consider  the  same  class  of  models  for  I  x  J  tables  where  one  of 
I  or  J  is  greater  than  3,  the  resulting  equations  are  systems  of  higher  order 
polynomials.  Clearly,  solving  such  systems  may  themselves  be  a  difficult 
task. 


In  the  next  section  we  will  show  another  algorithm  that  can  be  used  for 
this  problem.  ■ 


The  preceeding  examples  have  used  an  I-divergence  approach  to  the  IPFP. 
We  now  consider  the  approach  discussed  by  Haberman  (1974,  p.  64).  That 
discussion  uses  the  method  of  co-ordinate  cyclic  ascent  to  directly  maximize 
the  likelihood.  A  fixed  set  of  vectors  which  span  the  model  space, 7JI,  is 
chosen  and  the  likelihood  is  maximized  along  each  of  these  directions  in 


turn.  Specifically,  consider  a  set  of  vectors  1  =  if  :  y  «  r}  which  span 
/T»l ,  denote  the- log- likelihood  by  i( p-,s)  and  consider  an  initial  estimate  p° 
with  In  (p1^)  in  /Hl.  The  algorithm  proceeds  by  finding  p1  such  that 

In  (p1)  =  In  (p1"1)  +  oifY;  i  ■  y  mod  |r|, 

where  a.  is  determined  so  that 

|«i  •  <t(pi-1 ;s),fY>  |  <b*|t(p’;s)  -  t(p1-1;s)| 

for  some  fixed  b  in  (0,1).  Generally  we  would  find  c^.  by  attempting  to 
maximize 

(2.13)  i(exp  {In  (pi_1)  +  o^)). 

This  is  a  one-dimensional  maximization  problem  in  the  fixed  direction,  f  . 

This  problem  can  be  re-expressed  as: 

(2.14)  maximize  ^(p1), 
subject  to. 

In  (p1)  «  In  (p1-1)  +  span  (fy), 

which  Is  a  maximum  likelihood  problem  for  a  log  affine  model.  Csiszar's 

Theorem  3.1  showed  that  this  has  a  dual  representation  as  a  minimum  I-diver- 

gence  problem.  But  the  solution,  p1 ,  to  (2.14)  is  not  necessarily  a  P.D., 

even  though  p“  *  lim  p1  must  be  a  P.D.  To  rectify  this  situation  we  consider 
i-*<» 


the  related  problem: 
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(2.15)  maximize  ^(p1), 
subject  to. 

In  (p1)  e  In  (pi_1)  +  span  (f^.e), 

where  e  is  the  vector  of  all  ones.  The  dual  representation  of  this  problem 

is  to  consider  £  *  {P.D.'s  P  s.t.  I f  dP  =  [f  dPi_^}  and  then  form  p*  as 
Y  J  Y  J  Y 

P£  (P^).  Thus,  in  a  certain  sense,  the  co-ordinate  cyclic  ascent  methods 
Y 

are  conjugate  dual  problems  to  the  algorithms  of  Csiszar.  It  appears  that 
the  cyclic  ascent  method  is  easier  to  work  with  as  it  does  not  require  the 
result  of  each  iteration  to  be  a  P.D. 

If  we  use  the  duality  result  the  other  way  around,  we  can  describe  the 
I-projections  in  an  alternate  form.  For  example,  consider  the  linear  space 
£  *  &I  n  £2  generated  by  1 1  u  1  ^  and  suppose  we  wish  to  calculate  P£  (Q) 
for  an  arbitrary  P.D.  Q.  The  dual  to  this  problem  is  to 

maximize  H(p) , 

subject  to. 

In  (p)  «  In  (q)  +  span  (^.e). 

Thus  another  type  of  IPFP  maximizes  the  likelihood,  not  in  a  set  of  fixed 
directions,  but  in  a  set  of  planes  spanned  by  .  ,e).  If  R  is  a  starting 
vector  such  that  In  (r)  e 0*  and  IK  is  equal  to  span  ( 1  ] ,  1 2*. . . ,  ?k)  where 
the  3-  i  are  not  necessarily  1  dimensional  subspaces  then  the  following 
algorithm  converges.  Form  £j  .  =  span  (U  .,e)  and  P°  =  R.  Then  pi+1  is 
the  p  which  maximizes  t(p)  subject  to  In  (p)  is  in  span  (^.),  where 

=  ^  j  ^  i  s  J  mod  *<•  If  each  of  these  problems  is  easy  to  do  then 
this  may  form  a  useful  algorithm. 
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3.  The  Darroch  and  Ratcliff  Generalized  Iterative  Scaling 

The  proceeding  section  has  shown  how  Csiszar's  Theorems  and  the  "dual" 
theorems  of  Haberman  may  be  cyclically  applied  to  compute  I-projections 
and  maximum  likelihood  estimates.  A  paper  by  Darroch  and  Ratcliff  (1972) 
attacks  the  same  problem,  again  by  looking  at  it  from  the  information  theory 
point  of  view;  however  their  generalization  of  the  IPFP  is  different  from 
those  we  have  thus  far  considered.  Darroch  and  Ratcliff  (D  &  R)  succeed  in 
calculating  the  total  I-projection  without  necessarily  calculating  any  of 
the  marginal  I-projections.  In  the  "usual"  case,  i.e.,  where  the  space,  £, 
is  generated  by  vectors  containing  only  zeros  and  ones  their  generalization 
also  reduces  to  the  conventional  IPFP  algorithm. 

The  0  &  R  algorithm,  or  Generalized  iterative  Scaling  (GIS),  ensures 
that  at  each  iteration  equation  (2.8)  is  satisfied  (i.e.,  qR(x)  =  c.  exp  (g(x)), 
where  g  is  in  the  linear  space  spanned  by  the  constraints).  When  the 
algorithm  has  converged  one  is  able  to  show  that  the  fitted  values  also 
satisfy  the  marginal  constraints.  This  should  be  contrasted  to  the  algorithms 
we  have  discussed  earlier.  The  algorithms  of  Csiszar  and  Haberman  alternately 
satisfy  the  marginal  constraints,  with  only  the  final  fitted  values  neces¬ 
sarily  satisfying  equation  (2.8).  We  now  consider  the  D  &  R  algorithm  in 
more  detail. 

Let  1  =  (f  :  y  «  D  be  a  set  of  constraints  with  corresponding  con¬ 
stants  <^=  (a^  :  y  e  r}  and  consider  the  linear  set  of  P.D.'s 

£  =  (P.D.'s  P  ;  ff  dP  =  a  ;  y  e  r). 

J  y  y 

As  before,  £  is  just  the  set  of  P.D.'s  whose  ^ -margins  are  correct.  It 
is  always  possible,  as  0  4  R  show,  to  find  a  set  of  vectors  =  (g^  :  6  e  a} 
whose  span  is  the  same  as  span  (^  )  and  which  satisfy 
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(3.1)  g5  _>  0  V 6  and  z  (g{)  =  e. 

6e  & 

In  this  formulation, 

£  =  (P.O.'s  P  :  Jg6dP  =  b6;  6  e  l) 

where  the  br  are  determined  from  the  a  .  We  now  consider  the  problem  of 
6  y 

finding  the  I-projection  of  some  R  onto  £,  i.e.,  we  wish  to  find  a 
P  e  £  such  that 

p  =  r  •  exp{  Z  x.  •  g.) 

6  6 

where  the  x_  are  to  be  determined.  D  &  R  show  that  the  following  algorithm 
<5 

converges  to  the  M.L.E.: 

(i)  set  p°  =  r 

(ii)  set  pn+1  =  pn  •  JI  { (b  /<g  ,pn>) 

6  6 

=  pn  exp{  z  g  •  In  (b  /<g  ,pn>) 

6e  A 

where  6  -  n  mod  |a|.  The  algorithm,  as  given,  adjusts  for  all  of  the  marginal 
constraints  at  once.  However,  it  is  possible  to  adjust  for  several  sets  of 
simultaneous  constraints,  one  set  at  a  time  using  partitions  not  unlike  those 
those  discussed  in  Section  2. 

Consider  two  linear  spaces  of  P.D.'s,  £-|  and  defined  by  constraint 
sets  3  j  and  3  each  of  which  satisfies  equation  (3.1)  on  its  support. 
Csiszar's  results  suggest  that  to  calculate  the  I-projection  of  Qq  =  R 
onto  £  *  £^  n  £?  one  should  successively  form 

Qn  =  (0n_i ) »  i  =  n  mod  2. 

The  GIS  algorithm  would  be  one  way  of  calculating  these  elementary  I-projec- 
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tions.  Darroch  and  Ratcliff  suggest  an  alternative  approach  which  does  not 
necessarily  involve  calculating  the  individual  I-projections.  The  idea  is 
to  perform  one  iteration  of  the  previous  algorithm  on  the  space  £.|,  then 
one  iteration  on  £  ^  and  continue  cyclinq.  If  we  let  =  { g ^ ;  6  e  a.j  } 

and  =  (g^ ;  6  e  Aj),  and  let  (b  ;  6e  a^  }  and  {b^;  6e  A2 )  be  the  associated 
constraint  spaces  then  the  algorithm  would  proceed  as  follows: 

(i)  set  p°  »  r 

4,  . 

(ii)  set  pn+1  =  pn  •  n  ((bV<g’,pn>)  ).  where  i  =  n  mod  2. 

5=1 


To  illustrate  the  ideas  presented  here  we  reconsider  Example  2. 

Example  2  (continued). 

We  illustrate  one  of  the  ways  that  the  6IS  algorithm  can  be  used 
to  find  the  M.L.E.'s  for  Goodman's  association  model  in  a  3  *  3  table. 
Recall  that  the  constraints  came  in  natural  pairs  (e.g.,  fjij  and  fgR)  of  a 
row  (column)  and  ordered  row  (column)  function'.  These  pairs  do  not  satisfy 
equation  (3.1)  on  their  support,  but  we  can  convert  them  into: 

9R  *  fR  “  1  f0R  and  g0R  3  I  f0R 


which  still  span  the  same  space.  We  also  need  to  make  a  similar  adjustment 
to  the  constants,  viz: 


J  =  J  I  J 

°R  aR  2  a0R 


and  bQR  =  \  a0R* 
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Analogous  transformations  are  made  to  the  columns.  As  all  the  pairs  of 
constraints  are  similar  we  will  concentrate  on  the  pair  corresponding  to 
the  first  row  and  consider  only  one  step  of  the  algorithm.  We  note  that 


If  Qn  is  our  current  approximation  to  Pg(R)  then  the  GIS  algorithm  would  form 
as  its  next  estimate. 


The  algorithm  continues  by  considering  each  of  the  constraint  sets  in 
turn.  In  this  example  we  sometimes  need  to  take  square  roots  of  the  ratio 
of  the  observed  margin  to  the  expected  margin.  In  a  more  typical  situation 
we  would  take  arbitrary  powers  rather  than  just  square  roots. 

Note  that  applying  the  same  adjustment  to  the  new  table  (i.e.,  not 
cycling  through  the  pairs  of  constraints)  produces  another  new  table.  If 
we  were  to  continue  with  the  same  pair  of  constraints  we  would  arrive  at 
the  I-projection  onto  that  constraint  space.  Thus  in  many  respects  GIS 
just  combines  the  first  steps  of  an  algorithm  to  compute  I-projections.  ■ 
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The  GIS  algorithm  is  a  method  which  is  conceptually  easy  to  compute 
and  guaranteed  to  converge.  Unfortunately  the  algorithm  is  also  known  to 
converge  very  slowly  in  some  situations.  In  contrast,  the  Csiszar  approach 
is  appealing  as  it  maximizes  along  a  fixed  space  at  each  step  but  it  has 
the  disadvantage  that  the  elementary  I-projections  may  themselves  be  dif¬ 
ficult  to  compute  or  require  iteration.  Which  procedure  (or  combination) 
is  better  may  depend  on  the  problem  under  investigation  but  certainly 
requires  further  study. 


4.  Methods  for  Maximum  Likelihood  Estimation  in  Special  Cases 

In  this  section  we  shall  use  the  ideas  of  the  preceeding  sections  to 
study  some  problems  in  which  the  constraints,  3"  ,  have  a  special  structure. 
We  consider  as  examples  the  ordered  categories  model  for  a  3  *  3  table  Intro 
duced  earlier  and  a  special  situation  considered  in  Fienberg  and  Wasserman 
(1980).  In  both  examples  we  will  find  it  edifying  to  expand  the  table  (i.e. 
increase  the  number  of  cells)  and  fit  a  transformed  model  to  the  larger 
table.  Clearly  we  will  need  some  conditions  on  the  model  and  how  we  "expand 
the  table.  The  following  "theorem"  is  a  collection  of  conditions  which  we 
will  need  to  verify  in  the  examples.  In  general  verifying  the  conditions 
may  itself  be  a  difficult  task. 

Theorem  3 

Let  g  be  a  one  to  one  mapping  of  the  P.D.'s  on  a  set  X  into  the  P.D.'s 
on  a  set  X*.  If  £  is  a  linear  set  of  P.D.'s  on  X,  then  define  g(£)  = 

(g(P)  :  P  e  £}.  Let  £*  be  a  linear  set  of  P.D.'s  on  X*  such  that  g(£)  C  £*. 
If  g  is  such  that 

(4.1)  I ( P | | Q)  *  k-I(g(P)i|g(Q))  for  P.Q  e  £ 


and  Pg*(g(R)) e  g(£)  then 

PS(R)  *  g'1(Ps*(g(R))).  ■ 

This  theorem  allows  us,  under  certain  conditions,  to  calculate  the 
I-projections  in  a  transformed  problem  and  then  invert  the  transformation  to 
obtain  the  I-projection  in  the  original  setting.  There  are  at  least  two 
ways  of  using  the  theorem.  In  some  situations  it  may  be  possible  to  define 
the  linear  set  £*  so  that  g(£)  *  £*.  This  is  the  easier  case  and  it  essen- 


tially  just  relabels  the  problem.  However  even  such  simple  relabeling  can 
be  helpful  if  .it  helps  one  to  interpret  the  model  or  recognize,  say,  a 
model  in  the  transformed  space  for  which  closed  form  estimates  are  known  to 
exist.  The  second  application  of  the  theorem  requires  more  work  to  verify 
the  conditions,  but  is  also  more  generally  applicable.  Here  we  take 
a  linear  set  £*  which  is  much  larger  than  g(£),  we  need  to  prove 
that  Ps*(g(r))  e  g(&).  In  other  words,  even  though  £*  contains  g(£) 
we  need  to  show  that  for  any  g(R),  the  I-projection  onto  £*  is  always 
an  element  of  g(£).  For  a  particular  set  of  data  it  may  be  easy  to 
verify  this  condition.  All  we  need  do  is  fit  the  transformed  model  and 
see  if  the  I-projection  is  in  g(£).  To  prove  this  type  of  result  for 
a  general  class  of  R's  and  &'s  is  much  more  difficult.  These  ideas  are 
best  illustrated  in  the  context  of  two  examples. 

Example  2  (Continued) 

We  have  previously  shown  that  the  row  and  column  constraints 
can  be  considered  in  pairs  and  each  of  the  pairs  of  constraints  can  be  indi¬ 
vidually  fit.  Thus  if  (w-i^.Wj)  are  the  current  fitted  values  for,  say, 
the  first  row,  we  need  to  adjust  this  triple  so  that  its  row  and  ordered 
row  margins  match  some  specified  constants. 

Let  be  the  set  of  positive  triples  which  satisfy  the  row  and  ordered 
row  constraints  for  the  first  row,  i.e., 

*  {positive  triples,  q  :  2q1  +  q2  *  2aR  -  3qR  *  a3 
and  q2  +  2q3  =  aJR  =  a4>. 

As  the  vector  e  *  (1,1,1)  Is  in  the  span  of  the  space  of  constraints  which 
defines  we  can  apply  the  corollary  of  Cslszar's  Theorem  3-2  and  just 
work  wlthPg  (W).  Now  consider  the  function 
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W1 

1 

2  w2 

1 

7  *2 

*3 

and  define 


*  g^) 

*  { 2x2  tables 


such  that  a  +  bsa  +  c  =  2'  a3  an(1 
d+  c=  d+  d  =  ^  a^}. 


Note  that  the  constraints  on  £*  Imply  that  b  equals  c  which  means  that  g'1 
is  well  defined  on  &*.  It  is  not  a  difficult  calculation  to  verify  that 
I ( Q | )W)  =  I (g (Q) | | g(W) ) .  Our  theorem  now  allows  us  to  calculate  P£  (W)  as 
g"Vfi*(g(W)). 


The  constraints  which  define  &*  are  just  simple  row  and  column  margins 
Thus  the  I-projection,  P£*(g(W)),  can  be  calculated  by  the  usual  IPFP  (i.e. 
adjusting  row  and  column  margins),  or ,  as  it  in  a  2 x 2  table,  by  direct 
calculation.  As  the  logarithms  of  the  starting  values,  w,  do  not 
necessarily  satisfy  any  model,  the  IPFP  will  in  general  require  several 


iterations  to  converge.  Thus  to  obtain  the  I-projection,  P»(Q) , 

*R  n 


where  is  the  space  of  P.O.'s  which  satisfy  all  of  the  row  constraints, 
we  could  transform  each  row  of  the  3>«3  table  into  a  2  x  2  table,  cal¬ 


culate  with  the  2  x  2  table  and  then  use  g"1  to  return  a  triple  of  fitted 


values.  The  approach  for  the  columns  would  be  similar. 
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There  is  another  g,  which  transforms  the  entire  3*3  table  into  a 
2  *  2  *  2  *  2  table.  In  this  case  £*  =  g(£)  becomes  the  model  of  no  fourth 

4 

order  interaction  for  the  2  table.  Specifically, 


a 

b 

c 

d 

e 

f 

9 

h 

i 

a 

1  b 

c 

1  f 

}. 

9 

1  h 

i 

It  is  not  difficult  to  check  that  the  model  of  no  fourth  order  interaction 
corresponds  to  g(£)  and  that  I(PjjQ)  =  I (g(P) |jg(Q}) .  Therefore  the  usual 
IPFP,  with  starting  values  g(e)  and  the  model  of  no  fourth  order  interac¬ 
tion  applied  to  9(0n)  will  yield  a  24  table  of  fitted  values  which  can  in 
turn  be  transformed  (by  g”^)  into  a  3  x  3  table  for  the  original  problem.  ■ 


Both  applications  of  the  theorem  in  the  previous  example  used  an  £* 
which  was  equal  to  g(£).  The  following  example  gives  a  situation  where  £* 
is  much  larger  than  g(£).  Here  we  need  some  trickery  to  show  that  the  I- 
projection  of  g(R)  onto  £*  is  in  g{£). 

Example  3. 

Fienberg  and  Wasserman  (1980)  describe  a  class  of  log! inear  models 
for  some  multivariate  directed  graphs.  Their  paper  considers  as  an 
example  a  set  of  data  concerning  the  interrelationships  between  73  or¬ 
ganizations  in  a  small  community.  Three  types  of  relations  were  observed 
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for  each  of  the  pairs  of  organizations,  but  for  simplicity  we  restrict 
our  attention  to  two  of  these  criteria,  support  and  money.  For  each 
criterion  the  organizations  were  asked  to  respond  to  the  questions: 

Ci )  to  which  organizations  do  you  give  support  (money)? 

(ii)  from  which  organizations  do  you  receive  support  (money)? 

A  particular  directed  relationship  (i.e.,  giving  or  receiving)  is  regarded 
to  be  present  if  either  or  both  the  organizations  in  a  pair  perceived  the 
relationship.  For  each  pair  of  organizations  it  is  possible  to  construct 
a  four-vector  of  zeros  and  ones  indicating  the  presence  or  absence  of 
(support  out,  support  in,  money  out,  money  in).  Consider  for  a  moment 
just  the  support  relationship.  A  pair  of  organizations  are  said  to  have 
a  Mutual  relationship  if  they  support  each  other  (i.e.,  (support  out, 
support  in)  =  (1,1)),  a  Null  relationship  if  neither  supports  the  other 
(i.e.,  (0,0)),  or  an  Asymmetric  relationship  if  support  is  unreciprocated 
(i.e.,  (0,1)  or  (1,0)).  If  we  aggregate  over  all  =  2628  pairs  of 
organizations  there  are  ten  distinguishable  support-money  relationships, 
namely: 


MM  with  four  vector 

MA 

MN 

AM 

AA 

A* 

AN 

NM 

NA 

NN 


(l.l.l.l) 

(1,1, 0,1)  or  (1,1, 1,0) 

(1, 1,0,0) 

(0,1, 1,1)  or  (1,0, 1,1) 
(0, 1,0,1)  or  (1,0, 1,0) 
(0,1, 1,0)  or  (1,0,0, 1) 
(0,1 ,0,0)  or  (1,0,0,0) 
(0,0,1, 1) 

(0,0,1,0)  or  (0,0,0,1) 

(0,0,0,0) 
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Notice  that  when  both  relationships  are  asymmetric  there  are  two  different 
cases,  corresponding  to  whether  the  relationships  flow  the  in  the  same  or  in 
different  ways.  We  denote  the  table  of  observed  probabilities  by  Z  where 


for  example  zMM  is  the  number  of  mutual-mutual  relationships  divided  by 


(?)  • 


The  table  is  represented  by 


MONEY 


An  alternate,  though  somewhat  deceptive,  description  of  the  data  is 


to  consider  four- vectors  for  each  of  the 


x  2  ordered  pairs  of 


organizations  and  to  aggregate  this  into  a  24  table,  Y  s  y^^,  ijk*  s  lt2, 
where  a  1  indicates  the  presence  of  a  flow  and  a  2  indicates  the  absence 


of  a  flow.  Thus  ynn  Is  the  number  of  mutual-mutual  relationships 
divided  by  5256.  The  Y  table  duplicates  certain  relationships  and  gives 


double  weight  to  certain  others.  The  Y-table  has  the  form. 
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money  out  1  2 

money  in  1  2  1  2 

supp  out  supp  in 


We  now  consider  one  of  the  models  for  the  Z-table  considered  by 
Fienberg  and  Wasserman.  (The  same  arguments  work  for  all  of  their  models.) 
The  model  takes  as  a  linear  space,  £,  of  P.D.’s  the  set  of  tables,  S, 
which  have  margins  Sa+  and  S+jJ,  a,b  *  M,A,N,  which  are  the  same  as  the 
corresponding  margins  for  the  Z-table.  For  example  we  require 

Sa+  =  SAM  +  SAA  +  SAA  +  SAN  =  ZAM  +  ZAA  +  ZAA  +  ZAN  ZA+* 

This  model  can  be  fit  directly  to  the  Z-table  using  the  methods  of  the 
previous  sections.  As  the  model  space  can  be  spanned  by  vectors  consisting 
of  only  0's  and  1's,  both  the  D  &  R  and  Csiszar  algorithms  reduce  to  the 
same  simple  scaling  algorithm  which  takes  an  initial  table  of  all  l's 
and  successively  adjusts  the  row  and  column  "margins"  to  match  those  in 
the  observed  table.  This  algorithm  is  easy  to  do  by  hand,  but  because 
the  Z-table  Is  not  square  (i.e.,  it  has  10  cells  rather  than  the  9  one 
would  expect),  and  consequently  has  an  extended  Interpretation  of  marginal 
totals,  standard  IPFP  computer  programs  would  not  be  able  to  analyze  this 
table.  Moreover,  for  some  of  the  models  considered  by  Fienberg  and  Wasserman 
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the  models  are  not  so  simple  and  the  computations  on  the  2-table  require 
the  full  power  of  the  generalized  IPFP's.  For  this  reason  we  prefer  to 
work  with  a  transformed  problem,  where  the  sufficient  statistics  for 
the  models  can  be  represented  by  simple  marginal  totals. 

Consider 


N 

CM 

2ma 

ZMA 

2zmn 

2  AM 

ZAA 

zaA 

ZAN 

ZAM 

ZA A 

ZAA 

ZAN 

2znm 

ZNA 

ZNA 

2znn 

*  Y 

4 

which  maps  the  Z-table  into  the  2  Y-table.  We  denote  the  factors  support 
Cout,  in),  money  (out,  in)  by  the  numbers  1,  2,  3,  and  4.  It  is  now  easy 
to  see  that  the  marginal  sums  considered  for  the  Z-table  can  all  be  found 
(twice)  in  the  [12]  and  [34]  margins  of  the  Y-table.  Also  note  that  the 
Y-table  has  a  strong  symmetry,  y.jkJl  =  ^ Vijkfc.  Now  g(£)  is  just 

the  set  of  tables  which  have  (i)  the  correct  [12]  and  [34]  margins  and 
(ii)  preserve  the  observed  symmetry  in  the  Y-table.  Consider  just  the 
first  of  these  conditions  ignoring  the  symmetry  constraint.  It  is  this 
model  which  we  shall  consider  to  be  &*.  As  we  have  relaxed  some  condi¬ 
tions  it  is  clear  that  g(fi)  c  &*. 

It  Is  convenient  now  to  explicitly  define  the  space  &*  and  the 
conditions  we  need  to  verify  to  show  that  P&*(g(R))  is  in  g(£).  Consider 


A 
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and  constants  J  =  {a^,...,ag>  where  a^  =  <fj,g(Z)>.  Note  that  a2  =  a3 
and  ag  =  a7.  We  define  £*  to  be  the  space  of  P.D.'s  defined  by  %  and 
el  .  Now  consider  the  symmetry  transformation: 


h  :  yijkt*yjUk- 

ForPg*(g(R))  to  be  in  g(£)  we  require 

M  Pfi*(g(R))  *  Ps*(g(R)>. 


It  is  possible  to  assert  this  because  the  space  £*  is  invariant  under  h. 


Specifically  h( f^ )  =  f.  for  i  =  1,4, 5, 8  and  h(f2)  *  f3,  h(f3)  =  f2,  h(f7)  *  fg 
and  h(fg)  =  f7-  Because  a2  =  a3  and  ag  =  a7  the  linear  space  h(£*)  generated 
by  hC^)  and  h(^)  is  the  same  as  £*.  We  also  note  that  h(g(R) )  s  g(R), 


because  of  the  nature  of  g  function.  That  is  the  starting  values  necessarily 


satisfy  the  symmetry  constraints.  Now  let 
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/N 

Q“Pj*(g(H))  and 
Q  =Ph(&*)(h(g(r)))  -Pfi*(g(R)). 

*•  A 

But  note  that  Q  =  h(Q)  as  all  we  have  done  is  relabeled  the  co-ordinates. 
Thus 

Q  =  Q  =  h(Q) , 

i.e.,  the  fitted  P.0,  is  (i)  invariant  under  h  and  (ii)  is  in  2*.  Thus 
Q  is  in  g(2)  and  g  (Q)  is  the  fitted  P.D.  in  the  space  of  Z-tables. 

For  any  of  the  other  models  considered  by  Fienberg  and  Wasserman, 
it  is  easy  to  show  that  the  space,  £*,  is  invariant  under  h  and  thus 
the  above  argument  still  works.  ■ 

Both  the  examples  of  this  section  have  shown  situations  where,  for 
reasons  of  computational  ease,  it  was  desirable  to  transform  a  contingency 
table  into  a  related  but  larger  table.  In  the  transformed  table  it  was 
possible  to  fit  a  model  using  the  standard  IPFP  whereas  in  the  original 
table  the  corresponding  model  would  have  required  a  more  complicated 
algorithm.  This  approach  of  using  transformed  tables  is  especially 
important  in  practice  as  versions  of  the  standard  IPFP  are  widely 
available  and  easy  to  use.  An  additional  bonus  which  can  sometimes  be 
found  in  the  transformed  table  is  the  existence  of  closed  form  maximum 
likelihood  estimates.  The  theory  about  when  closed  form  estimates  exist 
in  complete  tables  with  factorial  models  is  well  known  and  such  situations 
are  easily  recognized.  On  the  contrary,  when  a  table  is  incomplete  or 
has  a  more  complicated  structure,  very  little  is  known  about  the  existence 
of  closed  form  estimates.  The  insight  gained  from  looking  at  the  trans¬ 
formed  table  may  also  assist  in  interpreting  the  models. 
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Appendi x  1 .  Analogies  with  Hilbert  Space 

In  this  appendix  we  discuss  some  of  the  analogies  between  the  IPFP  and 
methods  for  cyclically  calculating  projections  in  Hilbert  spaces. 

Consider  (V, <•,•>)  to  be  a  finite  dimensional  Hilbert  space  and  let 
£,,..., 8k  be  linear  subspaces  of  V  with  corresponding  orthogonal  projections 
Ei,...,E|(.  In  other  words  the  orthogonal  projection  of  a  vector  v  t  V  onto 
£.  will  be  denoted  by  E^v.  The  following  theorem  can  be  shown  to  be  true. 


Theorem  A1 . 1 


If  (V, <•,•>)  is  a  finite  dimensional  Hilbert  Space  and  are 

r  subspaces  of  V  t 
k 

£  =  n  £.  is  equal  to 
i=l  1 


linear  subspaces  of  V  then  the  orthogonal  projection  of  any  v  «  V  onto 
k 


Hm  {(Ek  •  Ek_1  •  ...  •  E.j )m  v).  ■ 
riH<» 

A  simple  extension  of  this  result  states  that  if  Qn  is  defined  to  be  the  pro¬ 
jection  of  Q  ,  onto  £  ,  where  £  =  £.  when  i  =  n  mod  k,  and  Q_  =  v  then 
n- 1  n  n  l  0 

Qn  converges  to  the  projection  of  v  onto  £.  This  is  a  direct  analogue  of 
Csiszar's  Theorem  3.2.  I  am  not  sure  if  the  above  theorem  is  always  true 
when  ( V , < • ,♦>)  is  an  infinite  dimensional  Hilbert  space,  but  it  Is  true 
when  any  of  the  £..  are  finite  dimensional.  There  is  however  a  version  of 
cyclically  projecting  onto  subspaces  which  is  always  true  (for  a  proof  see 
Von  Neumann  (1950)). 


Theorem  A1 .2. 


If  (V, <•,•>)  is  a  Hilbert  space,  the  orthogonal  projection  of  any  v  e  V 
k 


onto  £  =  n  £.  is  equal  to 
1=1  1 


lim  ( E i  •  E£  •  Ej 
rtH*> 


Ek-1  *  Ek  *  Ek-1 


El> 


m 


v. 


•  •  • 


In  this  version  of  the  theorem  we  use  a  symmetric  form  of  the  operator. 
Again  it  is  true  that  the  piecewise  projections  (in  the  correct  order)  con¬ 
verge.  The  advantage  of  Theorem  A1.2  is  that  powers  of  symmetric  operators 
generally  converge  more  quickly  than  do  powers  of  unsyrrmetric  ones. 

The  proof  of  Csiszar's  Theorem  3.2  can  easily  be  modified  to  prove  the 
symmetric  version  of  that  theorem.  Arguing  by  analogy  with  Hilbert  spaces 
we  conjecture  that  a  symmetric  form  of  the  IPFP  may  converge  more  quickly 
than  the  usual  version.  This  conjecture  needs  to  be  numerically  investigated. 
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Generalized  Iterative  Scaling;  I-divergence;  Kul lback-Leibler  Information 
Number;  Contingency  Tables 
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The  IPFP'  can  be  viewed  as  a  method  for  maximizing  the  likelihood  for  certain 
loglinear  models  or  equivalently  for  minimizing  the  Kullback-Leibler  Informa¬ 
tion  between  two  probability  densities.  Both  of  these  viewpoints  lead  to 
natural  generalizations  of  the  classical  1PFP.  We  examine  the  generalizations 
suggested  by  the  work  of  Csiszar  (1975),  Darroch  and  Ratcliff  f 1972}, -and  " 
ilaberman  (1974)^and,  with  Che  aid  of  the  theory,  explore  a  practical  example 
of  uxpanding  a  contingency  table. 
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